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^ . Abstract 

Mh' We compute the regression depth of a A;-flat in a set of n points K'', in time 

j-y^ . ©(n''^^ + nlogn) when 1 < k < d — 2. In constrast, the best time bound known 

for the k = case (data depth) or for the k — d — 1 case (hyperplane regression) is 
J^ ! ©(//-I +nlog«). 

Q ■ 1 Introduction 

• ■ Regression depth was introduced by Hubert and Rousseeuw |Q] as a distance-free quality 

O ■ measure for hnear regression. The depth of a hyperplane with respect to a set of data points 

I in M'' is the minimum number of data points crossed in any continuous motion taking the 

hyperplane to a vertical hyperplane. A vertical hyperplane is a regression failure, because 

it allows the response variable (that is, the dependent variable) to vary over its entire range 

^ '. while keeping the explanatory variables (the independent variables) fixed. Thus a good 

O I regression plane should be far from a vertical hyperplane. A deepest hyperplane is farthest 

Q ■ from vertical in a combinatorial sense; it provides a good fit even in the presence of skewed 

O ■ or data-dependent errors, and is robust against a constant fraction of arbitrary outliers. 

O I Due to its combinatorial nature, the notion of regression depth leads to many interesting 

Y^ • algorithmic and geometric problems. For points on the line R^, a median point is a point of 

:• . maximum depth. For the case of n points in the plane R^, Hubert and Rousseeuw [Q] gave 

.^ [ a simple construction called the catline, which finds a line of depth \n/3\ . The deepest line 

/\ ' in the plane can be found in time 0{n log n) [^]. The catline's depth bound is best possible, 

j^ ■ and more generally in M^ the best depth bound is \n/{d +1)] [ll|, ^- The fastest known 

exact algorithm for maximizing depth takes time 0{n^), and e-cutting techniques can be 



> 



used to obtain an 0{n)-time (1 + e)-approximation to the maximum depth []ll[]. 

In previous work [||], we generalized depth to multivariate regression, that is, fitting 
points in M^ by affine subspaces with dimension k < d — I (k-flats for short). We showed 
that for any d and k, deep ^-flats always exist, meaning that for any point set in R"^, there 
is always a ^-flat of depth a constant fraction of n, with the constant depending on d and 
k. This result implies that the deepest flat is robust, with a breakdown point which is a 
constant fraction of n. We also generahzed the catline construction to find lines with depth 
\n/{2d — 1)], which is tight for J < 3 and would be tight for all d under a conjectured 
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\n/{{k + \){d — k) + 1)] bound on maximum regression depth. On the algorithmic side, 
we showed that e-cuttings can be used to obtain an (9(n)-time (1 + e)-approximation for 
the deepest flat. 

In this paper, we consider the problem of testing the depth of a given flat, or more gen- 



erally the crossing distance between two flats. Rousseeuw and Struyf [jlOp studied similar 
problems for hyperplanes and points. The crossing distance between a point and a hyper- 
plane can be found in time 0{n'^~^ + nlogn) by examining the arrangement's restriction 
to the hyperplane (as described later), and the same bound applies to testing the depth of a 
hyperplane or point. We show that, in contrast, the depth of a flat of any other dimension 
can be found in randomized time 0{n^^^ + n log n). More generally, the crossing distance 
between aj-flat and a ^-flat can be found in time 0{n^^^^^ +n log n) when 1 <j,k< d—2. 

2 Definitions 

A generic ^-flat with k < d —I can move continuously to vertical without crossing any data 
points, so it is not obvious how to generalize regression depth to ^-flats. The key is to start 
from an equivalent definition of hyperplane regression depth: the depth of a hyperplane TC 
is the minimum number of data points in a double wedge with one boundary equal to H 
and the other boundary vertical (parallel to the response variable's axis). A double wedge 
is the closed region bounded by two hyperplanes; it is the region necessarily swept out by a 
continuous motion of one bounding hyperplane to the other. 

Now consider the simplest example with k < d — I, the regression depth of a line in 
M?. We think of x as the explanatory variable, and y and z as two response variables. A 
regression line simultaneously explains y and z as linear functions of x, and any line parallel 
to the yz-plane is a regression failure that allows y and z. to vary over their entire range while 
keeping x fixed. We would thus like our regression line to be far from lines parallel to the 
jz-plane. A reasonable guess at the definition of regression depth of a line £ would be the 
minimum number of data points in a double wedge with one boundary containing C and 
the other boundary parallel to the jz-plane. 

This guess indeed turns out to be the correct generalization; its naturalness is revealed 
by looking at the dual formulation of the problem. The projective dual of a point set is a 
hyperplane arrangement, and hyperplane regression dualizes to finding a central point in an 
arrangement. If the depth of a point p is the minimum number of arrangement hyperplanes 
crossed by any line segment from p to the hyperplane at infinity, then (as observed by 
Rousseeuw) the regression depth of a hyperplane is exactly the depth of its dual point in 
the dual arrangement. We generalized this observation to give a natural distance measure 
between flats in an arrangement [Q]. 

Definition 1. The crossing distance between two flats in an arrangement is the fewest 
hyperplane crossings along any line segment having one endpoint on each flat. 

In the primal formulation, the crossing distance is the minimum number of points in a 
double wedge with one boundary containing one flat and the other boundary containing the 
other. 



Definition 2. The regression depth of a ^-flat is the crossing distance between its dual 
{d — k — \)-flat and a k-flat at vertical infinity. 

For multivariate regression, the /:-flat at vertical infinity should be the one dual to the 
intersection of the hyperplane at infinity with the {k — (i)-flat spanned by the response vari- 
able axes. With this choice, regression failures have regression depth zero. For hyperplane 
regression, there is no choice to make as there is only one {d — l)-flat at infinity. 

Along with hyperplane regression. Definition ^ also subsumes the classical notion of 
data depth or Tukey depth. The data depth of a point p is the minimum number of data 
points in any closed half-space — a degenerate double wedge — containing p. The data depth 
ofp is also the crossing distance of its dual hyperplane from the point at vertical infinity. 

3 Reduction to Covering 

We now show that crossing distance can be reduced to finding a minimally covered point in 
a certain family of sets. Suppose we are given an arrangement of hyperplanes, aj'-flat Ti, 
and a /c-flat J^2- We wish to determine the Une segment, having one endpoint on each flat, 
that crosses as few arrangement hyperplanes as possible. 

We first parametrize the space of relevant line segments. Without loss of generality the 
two flats do not meet (else the crossing distance is zero) so any pair of points from jFj x T2 
determines a unique line. The pair divides the line into two complementary line segments 
(one through infinity), so we need to augment each point of Ti x ^2 by an additional bit 
of information to specify each possible line segment. We do this topologically: Ti is a 
projective space, having as its double cover aj-sphere Si, and similarly the double cover of 
T2 is a /:-sphere ^2. The product Si x 52 supplies two extra bits of information per point, 
and there is a continuous two-to-one map from Si x ^2 to the line segments connecting the 
two flats. 

Now consider subdividing Si x ^2 according to whether the corresponding line seg- 
ments cross or do not cross a hyperplane H of the arrangement. The boundary between 
crossing and non-crossing line segments is formed by the segments with an endpoint on a 
great sphere formed by intersecting H with Si or ^2. The line segments that cross H there- 
fore correspond to a set {Hi x TC2) U (Wi x TC2), where TCj is a hemisphere bounded by the 
intersection of H with Sj. A line segment crossing the fewest hyperplanes then corresponds 
to a point in the fewest such sets. 

For example. Figure |l] illustrates the case in which JFj and J^2 ^^ c^ch lines. The space 
of line segments with one endpoint on JFj and the other endpoint on J^2 is doubly covered 
by the two-dimensional torus Si x S2, which we have cut along two circles to show as 
a square. The solid dots represent the same line segment; the hollow dots represent the 
complementary line segment. Three covering sets of the form {Hi x H2) U {Hi x H2) are 
shown; their boundaries are shown dotted, dashed, and solid respectively, and the interiors 
of the sets are shaded. (The dotted boundary happens to align with the circles that cut the 
torus down to a square.) 

Since the union in each set of the form {Hi x H2) U {Hi x H2) is a disjoint union, we 
can simplify the problem a bit by cutting each such set into two products of hemispheres. 
We summarize the discussion above with a lemma. 
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Figure 1: Computing the crossing distance between two lines (1 -flats) is equivalent to find- 
ing a minimally covered point on the torus Si x 82- 

Lemma 1. Computing the crossing distance between flats J^\ and J-2 is equivalent to find- 
ing a point in a product of spheres Si x S2 that is covered by the fewest sets from a given 
family of subsets, each of which is a product of hemispheres Tii x TC2. 



4 Algorithms 

We now show how to solve the problem given in Lemma [T[ We first consider the special 
case of the crossing distance between a point and hyperplane, that is, j = and k = d—l. In 
this case, the product of spheres is a disjoint pair of {d— l)-spheres, both covered identically 
by a family of hemispheres, so we can treat it as if it were just a single sphere. We would 
like to find a point on this sphere that is covered by the fewest hemispheres. We can build 
the entire arrangement of hemispheres in time 0{n'^~^ +?ilog?i) using a slight modification 
of an algorithm for computing a hyperplane arrangement [||], and compute the number of 
hemispheres covering each cell by stepping from cell to cell in constant time per step. Any 
minimally covered cell gives a solution. 

Next let us consider the special case of the crossing distance between two lines, that is, 
j = k = 1. The product of spheres Si x ^2 is just a 2-torus, and the products of hemispheres 
are just products of semicircles. We cut the torus into a square as in Figure |I]; each product 
of semicircles turns into a set of at most four rectangles. We refer to the horizontal and 
vertical projections of these rectangles as segments. 

We can now use a standard sweep-line algorithm to compute a point in Si x 52 covered 
by a minimum number of sets. Conceptually we sweep a vertical line from left to right 
across Figure |l|. We use a segment tree [§] to represent the vertical segments crossed by 
the sweep line; let us assume that vertical represents ^2. As usual with segment trees, each 



vertical segment 7^2 appears at 0{logn) nodes of the segment tree, exactly those nodes 
whose intervals are covered by 0.2 but whose parents' intervals are not covered by TC2- 
(Here we denote vertical segments by TC2, even though some of them are just "halves" of 
the original semicircles 7^2-) 

We equip each node v of the segment tree with an additional piece of information: the 
minimum number of H2 segments covering some point in the interval corrresponding to v. 
This coverage number can be computed by taking the minimum of the numbers at v's two 
children and adding the number of segments listed at v itself. 

We sweep horizontally across the square. The events in the sweep algorithm correspond 
to endpoints of segments on S\ . At each endpoint of a segment we update the segment tree 
along with the coverage numbers at its nodes. Coverage numbers change at only 0(log«) 
nodes: the ancestors of the nodes storing the newly inserted or deleted vertical segment. 
The coverage number at the root gives the minimally covered cell currently crossed by the 
sweep line. We also maintain the overall minimum covering seen so far, and update this 
minimum at each event. At the end of the sweep, the overall minimum gives the answer. 
We have obtained the following theorem. 

Theorem 1. The crossing distance between two lines in an arrangement in M , or the 
regression depth of a line in M , can be found in time 0{n log n). 

For general j and k, we use a randomized recursive decomposition in place of the seg- 
ment tree. 

Lemma 2. Given an arrangement ofhyperplanes in M , we can produce a recursive binary 
decomposition of Mr, with high probability in time 0(n + n log n), such that any half space 
bounded by an arrangement hyperplane has (with high probability) a representation as a 
disjoint union of decomposition cells with 0{n + log?i) ancestors. 

Proof sketch: We apply a randomized incremental arrangement construction algorithm. 
Each cell in the recursive decomposition is an arrangement cell at some stage of the con- 
struction. The bound on the representation of a halfspace comes from applying the methods 
of 1^ pp. 120-123] to the zone of the boundary hyperplane. ■ 

The same method applies essentially without change to spheres and hemispheres, so 
we can apply it to the sets occurring in Lemma |I[ Each product of hemispheres occurring 
in Lemma [I] can be represented as disjoint unions of 0{nJ^'^^^) products of cells in the 
product of the two recursive decompositions formed by applying Lemma |^ to Si and S2- 
Since there are 0{n) products of hemispheres, we have overall 0{nJ~^^~^) products of cells. 

The algorithm has a similar structure to the algorithm for the case j = k = 1 , only the 
simple sweep order for processing the cells of Si is replaced by a depth-first traversal of the 
recursive decomposition of Si . As in the sweep algorithm, we maintain a coverage number 
for each cell of the decomposition of ^2. The coverage number measures the fewest 7^2 
hemispheres covering some point in that cell, where the 7^2 hemispheres come from pairs 
Til X 7^2 for which Hi covers the current cell in the traversal of Si. These numbers are 
computed by taking the minimum number for the cell's two children and adding the number 
of hemispheres whose decomposition uses that cell directly. When the traversal visits a cell 



in Si , we determine the set of hemispheres whose decomposition uses that cell, and update 
the numbers for the ancestors of cells covering the corresponding hemispheres in 52- Each 
hemisphere product leads to 0{nJ^^^^) update steps, so the total time for this traversal 
is 0{nJ^^^^). We also maintain the overall minimum covering seen so far, and take the 
minimum with the number at the root of the decomposition of ^2 whenever the depth-first 
traversal reaches a leaf in the decomposition of Si . 

When one flat — say jFj — is a line, this method's time includes an unwanted logarithmic 
factor. To avoid this factor, we return to a sweep algorithm as the case j = k = I. We sweep 
across Si , using the hierarchical decomposition data structure for ^2 in place of the segment 
tree. When the traversal reaches an endpoint of an interval Hi, we update the cells for the 
corresponding hemisphere 'H2- 

We summarize with the following theorem. It is likely that e-cuttings can derandomize 
this result. 

Theorem 2. The crossing distance between a j-flat and a k-flat can be found with high 
probability in time 0{n^^ + n log n)for 1 < j, k. The depth of a k-flat for 1 < k < d — 2 
can be found in time 0{n + n log n) with high probability. 
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